This tutorial follows the vignettes written by Ben Schmidt to
illustrate his wordVectors package. See the introductary
and exploration
vignettes. See, too, his longer blog post
on vector space models for the humanities
This tutorial walks through training the model on our Nobel corpus
which is a very small corpus for word embedding. This is done
more as a demonstration of the wordVectors package than as
something likely to give us valuable insight. First we write our Nobel
corpus to a text file and save it to our current directory, then we
instruct wordVectors to prep this file by tokenizing it,
changing all upper cases to lower (we’ve already done this but the
package doesn’t know that) and looking for commonly occurring bigrams.
Then we train the model, which requires it write to another file in so
doing, which we duly indicate.
# install.packages("devtools")
# devtools::install_github("bmschmidt/wordVectors")
library(wordVectors)
library(magrittr)
library(tidyverse)
nobel <- read_rds("data/nobel_cleaned.Rds")
write_lines(nobel$AwardSpeech, "nobel.txt")
prep_word2vec(origin="nobel.txt",destination="nobel_prep.txt",lowercase=T,bundle_ngrams=2)
model <- train_word2vec("nobel_prep.txt","nobel_vectors.bin", vectors = 200, threads = 4 , window = 10, iter = 5, negative_samples = 10, force = TRUE)
With our model trained, the most obvious thing to do is to look at individual words and see which other words are closest to them in terms of cosine similarity.
model %>% closest_to("peace", n = 15)
## word similarity to "peace"
## 1 peace 1.0000000
## 2 noble 0.6221961
## 3 congresses 0.6208113
## 4 prerequisite 0.6186160
## 5 realities 0.6091034
## 6 fraternity 0.5951835
## 7 secured 0.5921423
## 8 reconciliation 0.5900457
## 9 understanding 0.5874524
## 10 bjørnson 0.5821669
## 11 enduring 0.5805418
## 12 foundations 0.5805184
## 13 promotion 0.5803314
## 14 genuine 0.5780679
## 15 advancement 0.5711615
closest_to allows for easy vector addition and
subtraction. We can, for example, try the classic (and perhaps a bit
tired) example:
model %>% closest_to(~"king"+"woman"-"man")
## word similarity to "king" + "woman" - "man"
## 1 king 0.8287436
## 2 martin_luther 0.6857282
## 3 andrei_sakharov 0.6314907
## 4 carlos 0.6060730
## 5 nelson_mandela 0.6017250
## 6 visiting 0.5819703
## 7 literature 0.5804038
## 8 company 0.5725947
## 9 clinton 0.5610304
## 10 gentle 0.5602383
Well that didn’t work! But we shouldn’t really be surprised, we’re using a tiny corpus and one not likely to be talking too much about kings or queens. More meaningful for this sort of corpus might be:
model %>% closest_to(~"nuclear" + "peace")
## word similarity to "nuclear" + "peace"
## 1 nuclear 0.8696372
## 2 peace 0.7575096
## 3 test_ban 0.7009185
## 4 preventing 0.6819124
## 5 obama's 0.6798778
## 6 explosions 0.6730134
## 7 international_physicians 0.6665144
## 8 testing 0.6653571
## 9 ican's 0.6646791
## 10 ican 0.6543100
model %>% closest_to(~"nuclear" - "peace")
## word similarity to "nuclear" - "peace"
## 1 nuclear 0.7231207
## 2 atomic 0.4170524
## 3 explosions 0.4168727
## 4 test 0.4154024
## 5 nuclear_weapons 0.4098148
## 6 bomb 0.4070871
## 7 testing 0.4063369
## 8 warfare 0.3809506
## 9 bombs 0.3749656
## 10 nuclear_tests 0.3655815
A rough approximation of Kozlowski, Taddy, and Evans (2019) might be to construct a “cultural” vector (we’ll just use one binary pair and take the difference rather than averaging over multiple pairs) and then measuring cosine similarity to other words – ie to what extent they point in the direct of our vector (in the direction of “peace”) or towards “violence” (which will be a negative number, the lower the more similar).
peace <- model[rownames(model) == "peace"]
violence <- model[rownames(model) == "violence"]
pv_spectrum <- peace-violence
cosineSimilarity(pv_spectrum, model[["treaty"]])
## [,1]
## [1,] 0.2830068
cosineSimilarity(pv_spectrum, model[["armistice"]])
## [,1]
## [1,] 0.08531787
cosineSimilarity(pv_spectrum, model[["violation"]])
## [,1]
## [1,] -0.146485
cosineSimilarity(pv_spectrum, model[["aggression"]])
## [,1]
## [1,] -0.3525665
cosineSimilarity(pv_spectrum, model[["war"]])
## [,1]
## [1,] -0.1732886
All told and for such a small corpus, this seems not half bad.
We might also try to plot this cultural axis using two binary opposite word pairs and then see where other words land in similarity to the difference between the binaries (similar to what we did above). In order to subset our total corpus, we’ll plot the 200 words most similar to “politics”.
violation <- model %>%
closest_to(~ "violation"-"rights",n=Inf)
war <- model %>%
closest_to(~ "war" - "peace", n=Inf)
politics <- model %>%
closest_to("politics", n = 200)
politics %>%
inner_join(violation) %>%
inner_join(war) %>%
ggplot() +
geom_text(aes(x=`similarity to "violation" - "rights"`,
y=`similarity to "war" - "peace"`,
label=word))
wordVectors includes multiple nice plotting features.
One is via principle component analysis, which reduces many dimensions
to a smaller number (here 2) based on the two most informative
dimensions running through the original many-dimensional space. Here
we’ll try to compare words grouped around “peace”.
peacewords <- model %>% closest_to("peace", n = 50)
peace = model[[peacewords$word,average=F]]
plot(peace,method="pca")
Or we can use t-sne, another dimension reduction method to project our word vectors onto two-dimensional space.
plot(model,perplexity=50)
This definitely shows words that tend to show up together. Perhaps some interesting things here, though historians are probably likely to find graphs like this most interesting compared over time.
2022.